Summary Paper

Introduction

While the relationship between economic inequality and health, generally, has been extensively studied, less focus has been paid to the relationship with mental health, specifically. Further, the results seem sensitive to the geographic locations studied and measures used. As a result, although the research area is not unexplored, there is still room for novel empirical investigations.

As an initial step, we conducted a literature review to better understand the scholarly work on the topic. The most helpful articles on the topic were two systematic reviews that sought to summarizing the relationship between income inequality and mental health. First, Ribeiro et al. (2017) conducted a meta-analysis of 27 studies exploring the relationship between income inequality and the incidence of mental health problems. Of these articles, 9 found a positive association, 10 found mixed results, and 8 found no association. The authors concluded, “Income inequality negatively affects mental health but the effect sizes are small and there is marked heterogeneity among studies.” Second, Tibber et al. (2022) systematically reviewed 42 subnational studies (i.e., within countries rather than across countries), with a plurality of these studies focusing on US populations. After controlling for absolute deprivation, approximately 55 percent of the studies supported an association between higher inequality and poorer mental health, 12 percent supported an association between higher inequality and better mental health, and the remaining studies (roughly 1/3) did not establish a conclusive association. The results were robust to mental health conditions, geographical unit size, and across countries with different incomes.

The upshot of this literature review was twofold. First, the nature of the relationship between income inequality and mental health has not been conclusively answered. Second, the relationship might differ based on the geographic area being considered. Therefore, we believe this topic is worthy of further study.

In addition, the research we gathered contributed to the development of our SMART questions. Primarily, we discovered that the relationship could vary by geography, making it important to examine the data for different parts of the U.S. Throughout the investigation of our topic, we developed the following SMART questions:

  • What is the relationship between the prevalence of mental health problems and income inequality across U.S. counties from 2016 to 2021?
  • Does the relationship differ across regions of the U.S.?

Overall, we found that the prevalence of mental health conditions seems positively associated with greater economic inequality in the U.S. This relationship also seems to be different depending on the region.

The primary dataset used for this project is a combination of annual datasets called the County Health Rankings and Roadmaps National Data and was obtained the Robert Wood Johnson Foundation (RWJF). The Robert Wood Johnson Foundation is a non-profit public health foundation-based New Jersey that collects data on various socioeconomic and public health indicators. The County Health Rankings and Roadmaps National Data consists of county-level socioeconomic and public health data. The Robert Wood Johnson Foundation collects data from the National Center for Health Statistics, the Centers for Disease Prevention and Control (CDC) Behavioral Risk Factor Surveillance System, the National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention, USDA Food Environment Atlas, the Centers for Medicare & Medicaid Services (CMS) National Provider Identifier Standard, the Stanford Education Data Archive, and the U.S. Census Bureau (RWJF, 2021).

The full dataset includes data from 2011 to 2021. For this project the dataset a subset of the data was taken from 2016 to 2021. This time-period was chosen because it provided the most complete for all the variables of interest. Years prior to 2016 upon an initial inspection of the annual datasets were incomplete or did not have variables that exactly corresponded to variables in 2016 to 2021.

The full dataset includes thirty-three variables. The following variables are listed below including descriptions of each variable. However, the scope of this project was primarily concerned with two mental health variables and two economic inequality variables. The primary economic inequality variables of interest include household income ratio and median income. These two economic inequality variables were used as independent variables. The primary mental health variables of interest include the number of mentally unhealthy days and the percentage of frequent mental distress. These two mental health variables were used as dependent variables. The table below displays the summary statistics of the dataset.

The income ratio variable, also called inequality or income inequality in the exploratory data analysis, is the ratio of the household income at the 80th percentile and income at the 20th percentile. Income is defined as the sum of the amounts reported separately for wages, salary, net self-employment income, interest dividends, royalty income from estates/trusts, Social Security, Railroad Retirement income, Supplemental Security Income (SSI), public assistance/welfare payments, retirement, and survivor/disability pensions. A higher inequality ratio indicates greater division between the top percentile and bottom percentile of the income spectrum. The RWJF obtained this data from the American Community Survey (ACS) which is conducted by the U.S. Census Bureau and collects and produces population and housing information (RWJF, 2021). The total number of observations for the income ratio variable is 18,465.

The median income variable, also called median_inc in the exploratory data analysis, is the income in the middle of the household income distribution of the county. Half of the incomes are greater and lesser than the median income of a county. Income is defined as the sum of the amounts reported separately for wages, salary, net self-employment income, interest dividends, royalty income from estates/trusts, Social Security, Railroad Retirement income, Supplemental Security Income (SSI), public assistance/welfare payments, retirement, and survivor/disability pensions. The RWJF obtained this data from the American Community Survey (ACS) which is conducted by the U.S. Census Bureau and collects and produces population and housing information (RWJF, 2021). The total number of observations for the median income variable is 18,469.

The number of mentally unhealthy days, also called mental_health_days in the exploratory data analysis, is the average number of mentally unhealthy days reported in the past 30 days. Mentally unhealthy days is self-reported by respondents answering the question, “Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?” The RWJF obtained this data from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is a health survey conducted by the CDC that collects information on health-related behavioral risk factors (RWJF, 2021). The total number of observations for the number of mentally unhealthy days variable is 18,469.

The percentage of frequent mental distress, also called mental_distress_rate in the exploratory data analysis, is the percentage of adults who reported 14 or more days as mentally unhealthy. Mentally unhealthy days is self-reported by respondents answering the question, “Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?” The estimate of the percentage of frequent mental distress is based on the measurement of the number of mentally unhealthy days variable above. The RWJF obtained this data from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is a health survey conducted by the CDC that collects information on health-related behavioral risk factors. The total number of observations for the percentage of frequent mental distress is 18,469.

For the exploratory data analysis, the data was also separated into regions based on state. This project separated the states into regions as defined by the U.S. Census Bureau. The states were divided into four regions the Midwest, Northeast, South, and West. The states in the West include Washington, Oregon, California, Montana, Idaho, Wyoming, Nevada, Utah, Colorado, Arizona, New Mexico, Alaska, and Hawaii. The states in the Midwest include North Dakota, South Dakota, Nebraska, Kansas, Minnesota, Iowa, Missouri, Wisconsin, Illinois, Indiana, Michigan, and Ohio. The states in the South include Texas, Oklahoma, Arkansas, Louisiana, Mississippi, Alabama, Tennessee, Kentucky, Virginia, West Virginia, Georgia, Florida, North Carolina, South Carolina, District of Columbia, Maryland, and Delaware. Finally, the states in the Northeast include Pennsylvania, New York, New Jersey, Connecticut, Massachusetts, Rhode Island, Vermont, New Hampshire, and Maine (U.S. Census Bureau, 2010).

The variables in the data set are:

  • region: name of the US Census Bureau region (name)
  • division: name of the US Census Bureau division (contained with a census region)
  • state: two letter state abbreviation
  • statecode: FIPS state code
  • countycode: FIPS county code
  • fipscode: 5-digit FIPS Code (county-level); combines statecode and countycode
  • county: county name
  • year: report release year from County Health Rankings; range of 2016-2021
  • county_ranked: Indicates whether or not the county was ranked; 0=unranked, 1=ranked, or NA for aggregated national or state-level data
  • mental_health_days: Average number of mentally unhealthy days reported in past 30 days (age-adjusted)
  • mental_distress_rate: Percentage of adults reporting 14 or more days of poor mental health per month
  • inequality: Ratio of household income at the 80th percentile to income at the 20th percentile (Income inequality)
  • median_inc: The income where half of households in a county earn more and half of households earn less
  • hs_grad: Percentage of adults ages 25 and over with a high school diploma or equivalent
  • college: Percentage of adults ages 25-44 with some post-secondary education
  • unempl: Percentage of population ages 16 and older unemployed but seeking work
  • child_poverty: Percentage of people under age 18 in poverty
  • single_parent: Percentage of children that live in a household headed by single parent
  • severe_housing: Percentage of households with severe housing problems
  • food_index: Index of factors that contribute to a healthy food environment, from 0 (worst) to 10 (best)
  • mh_providers: rate of providers to 100,000 population
  • pop_provider_ratio: ratio of population to mental health providers (i.e., population served per provider)
  • pop: census population estimate
  • pct_below18: percent of population younger than 18
  • pct_black: percent of population that are African-American or non-Hispanic Black
  • pct_native_am: percent of population that are Native American or Alaska Natives
  • pct_asian: percent of population that are Asian
  • pct_pacific: percent of population that are Native Hawaiian or Other Pacific Islander
  • pct_hispanic: percent of population that are Hispanic
  • pct_white: percent of population that are non-Hispanic white or Caucasian
  • pct_female: percent of population that are female
  • pct_rural: percent of population that live in rural areas

For more information, see the measures online.

# look at county_ranked var; not all counties are ranked; also some aggregated data per state and country exist in the observations
# =1 means they are ranked, =0 means unranked, and =NA is for state/national data
#print(summary(dframe$county_ranked))

# subset of dataframe including only ranked counties
ranked <- dframe %>% subset(county_ranked==1)

# subset of dataframe including only ranked counties
unranked <- dframe %>% subset(county_ranked==0)

# subset of dataframe including only aggregated data
aggregated <- dframe %>% subset(is.na(county_ranked))

# duplicate column and rename level labels for easier reading
ranked$region_abb <- ranked$region
levels(ranked$region_abb) <- c("", 
                              "MW",  # re-level factor labels
                              "NE",
                              "S", 
                              "W")

# subset ranked data by region
ranked_MW <- ranked %>% subset(region=="Midwest")
ranked_NE <- ranked %>% subset(region=="Northeast")
ranked_SO <- ranked %>% subset(region=="South")
ranked_WE <- ranked %>% subset(region=="West")

# subset ranked data into annual datasets
ranked16 <- ranked %>% subset(year==2016)
ranked17 <- ranked %>% subset(year==2017)
ranked18 <- ranked %>% subset(year==2018)
ranked19 <- ranked %>% subset(year==2019)
ranked20 <- ranked %>% subset(year==2020)
ranked21 <- ranked %>% subset(year==2021)

# sort dataframe
ranked <- ranked[order(ranked$year, ranked$region, ranked$division, ranked$statecode, ranked$countycode), ]
# groupby year
ranked_by_year <- ranked %>% group_by(year)

summ_by_year <- ranked_by_year %>%  summarise(num_counties = n_distinct(fipscode), 
                              num_states = n_distinct(statecode),
                              wmean_inequality = weighted.mean(inequality, pop, na.rm=T), 
                              wmean_mh_rate = weighted.mean(mental_distress_rate, pop, na.rm=T), 
                              wmean_mh_days = weighted.mean(mental_health_days, pop, na.rm=T), 
                              wmean_unempl = weighted.mean(unempl, pop, na.rm=T), 
                              wmean_medinc = weighted.mean(median_inc, pop, na.rm=T),
                              )

xkabledplyhead(summ_by_year, 6, title = "Table: Data grouped by Year")
Table: Data grouped by Year
year num_counties num_states wmean_inequality wmean_mh_rate wmean_mh_days wmean_unempl wmean_medinc
2016 3075 51 4.73 0.110 3.61 0.0626 56225
2017 3071 51 4.75 0.113 3.71 0.0537 58214
2018 3078 51 4.74 0.116 3.80 0.0495 60385
2019 3081 51 4.74 0.116 3.80 0.0443 62851
2020 3084 51 4.72 0.123 4.00 0.0396 65104
2021 3081 51 4.70 0.137 4.41 0.0373 68859
# groupby year and state
ranked_by_year_state <- ranked %>% group_by(year, statecode)

summ_by_year_state <- ranked_by_year_state %>% summarise(num_obs = n(), 
                                   state_name = first(state),
                                   region = first(region),
                                   wmean_inequality = weighted.mean(inequality, 
                                                                    pop, na.rm = T),
                                   med_inequality = median(inequality, na.rm = T),
                                   wmean_mh_rate = weighted.mean(mental_distress_rate, 
                                                                 pop, na.rm=T),
                                   med_mh_rate = median(mental_distress_rate, na.rm=T),
                                   wmean_mh_days = weighted.mean(mental_health_days, 
                                                                 pop, na.rm=T), 
                                   med_mh_days = median(mental_health_days, na.rm=T),
                                   wmean_unempl = weighted.mean(unempl, 
                                                                pop, na.rm=T), 
                                   med_unempl = median(unempl, na.rm=T),
                                   wmean_medinc = weighted.mean(median_inc, 
                                                                pop, na.rm=T),
                                   med_medinc = median(median_inc, na.rm=T)
                                   )


# groupby region
ranked_by_region <- ranked %>% group_by(region)

summ_by_region <- ranked_by_region %>%  summarise(num_counties = n_distinct(fipscode), 
                              num_states = n_distinct(statecode),
                              wmean_inequality = weighted.mean(inequality, pop, na.rm=T), 
                              wmean_mh_rate = weighted.mean(mental_distress_rate, pop, na.rm=T), 
                              wmean_mh_days = weighted.mean(mental_health_days, pop, na.rm=T), 
                              wmean_unempl = weighted.mean(unempl, pop, na.rm=T), 
                              wmean_medinc = weighted.mean(median_inc, pop, na.rm=T),
                              )

xkabledplyhead(summ_by_region, 4, title = "Table: Data grouped by Region")
Table: Data grouped by Region
region num_counties num_states wmean_inequality wmean_mh_rate wmean_mh_days wmean_unempl wmean_medinc
Midwest 1037 12 4.50 0.117 3.82 0.0453 58907
Northeast 217 9 5.13 0.117 3.88 0.0478 68950
South 1414 17 4.70 0.125 4.02 0.0470 56914
West 437 13 4.69 0.114 3.76 0.0511 67705
# groupby year and region
ranked_by_year_region <- ranked %>% group_by(year, region)

summ_by_year_region <- ranked_by_year_region %>% summarise(num_obs = n(), 
                                   region_name = first(region),
                                   sum_pop_millions = sum(pop/1000000, na.rm = T),
                                   wmean_inequality = weighted.mean(inequality, pop, na.rm = T),
                                   wmean_mh_rate = weighted.mean(mental_distress_rate, pop, na.rm=T),
                                   wmean_mh_days = weighted.mean(mental_health_days, pop, na.rm=T)
                                   )

Our data are identified at the county-year level. Our data set contains 19164 observations and 32 variables, although 319 of these observations are for aggregated data at the national or state level. In addition, 375 observations are for counties that are unranked by the University of Wisconsin Population Health Institute, suggesting that the data for these counties is less reliable.

In total, we have 18470 observations in the ranked data, combined across 6 annual reports (2016–2021). Each annual data set has between 3071 and 3084 observations, indicating that the number of ranked counties is consistent over time. We also grouped the data on several dimensions help us better investigate time and geographic patterns in our data.

First, we grouped the data set by report year and calculating the population-weighted means for several variables. In general, the mean income inequality has stayed relatively constant over time, while the mental health variables have risen slightly. Other economic variables, such as unemployment and median income, have moved in positive directions (unemployment has fallen, while median incomes have increased).

Next, we grouped the data by year and state and calculated population-weighted means and state-level medians. There was a lot more variation among the data at this level, which will be helpful when plotting aggregated data later.

Finally, we grouped the data by region, giving us a sense of the size of the 4 census regions: the South contains the most states and counties and has the largest population, while the Northeast is the smallest on these measures. The Northeast has highest measured inequality, and the Midwest has the lowest. The mental health variables are highest (i.e., worst) in the South and lowest in the West. These general trends hold up when grouping by year and region. These differences suggest that comparing the relationship between the variables across regions would be worthwhile.

Limitations of the Dataset

The main limitation of this dataset is that it is largely based on survey data. The income ratio and median income is based on a sample surveyed by the U.S. Census Bureau. The number of mentally unhealthy days and percentage of frequent mental distress are based on a sample self-reported survey conducted by the CDC. While both the U.S. Census Bureau and the CDC are reputable institutions and reliable sources of data there is always a possibility that the sample is biased, or the survey is improperly conduct or the survey deviates from year to year.

Additionally, given that the data is collected at the county-level, local level factors influence the type of information collected making comparisons across state lines questionable given the different geographical, social, and economic contexts. Furthermore, self-reported mental health cannot be validated in medical records. Additionally, the data for each variable is not normally distributed. This can be seen in the plotted histograms and normal Q-Q plots.

Exploratory Data Analysis

After establishing our initial questions and taking a preliminary look at our data, we conducted exploratory data analysis to investigate our research questions.

Summary Statistics

We examined the summary statistics for all the variables in our data set. We also took at look at a smaller subset of variables most relevant to our analysis.

#Summary Statistics of all Variables used in dframe dataset
# use ranked instead
table1<-describeBy(ranked, type = 1) ## type of kurtosis and skewness to calculate
table1 %>%
  kbl(caption="Summary Statistics for Master Dataset",
       format= "html", col.names = c("Var Num.","Count","Mean","Std. Dev.","Median", "Trimmed Mean", "Mad", "Minimum", "Maximum","Range", "Skewness", "Kurtosis", "S.E."),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Summary Statistics for Master Dataset
Var Num. Count Mean Std. Dev. Median Trimmed Mean Mad Minimum Maximum Range Skewness Kurtosis S.E.
region* 1 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008
division* 2 18470 6.50e+00 2.88e+00 8.00e+00 6.63e+00 2.96e+00 2.000 1.00e+01 8.00e+00 -0.379 -1.413 0.021
state* 3 18470 2.70e+01 1.42e+01 2.60e+01 2.73e+01 1.78e+01 1.000 5.10e+01 5.00e+01 -0.028 -1.205 0.105
statecode* 4 18470 2.82e+01 1.43e+01 2.70e+01 2.85e+01 1.78e+01 2.000 5.20e+01 5.00e+01 -0.049 -1.214 0.105
countycode* 5 18470 6.55e+01 5.82e+01 5.30e+01 5.61e+01 4.45e+01 2.000 3.29e+02 3.27e+02 1.949 4.759 0.428
fipscode* 6 18469 1.60e+03 9.23e+02 1.59e+03 1.60e+03 1.18e+03 3.000 3.20e+03 3.20e+03 0.010 -1.204 6.793
county* 7 18470 9.43e+02 5.31e+02 9.45e+02 9.40e+02 6.61e+02 1.000 1.89e+03 1.89e+03 0.046 -1.123 3.908
year* 8 18470 3.50e+00 1.71e+00 4.00e+00 3.50e+00 1.48e+00 1.000 6.00e+00 5.00e+00 -0.002 -1.268 0.013
county_ranked* 9 18470 2.00e+00 0.00e+00 2.00e+00 2.00e+00 0.00e+00 2.000 2.00e+00 0.00e+00 NaN NaN 0.000
mental_health_days 10 18469 4.04e+00 6.92e-01 4.02e+00 4.03e+00 7.06e-01 2.100 7.29e+00 5.19e+00 0.223 -0.112 0.005
mental_distress_rate 11 18469 1.26e-01 2.40e-02 1.24e-01 1.25e-01 2.40e-02 0.066 2.47e-01 1.81e-01 0.532 0.302 0.000
inequality 12 18465 4.52e+00 7.33e-01 4.41e+00 4.45e+00 6.28e-01 2.543 1.20e+01 9.43e+00 1.254 3.433 0.005
median_inc 13 18469 5.08e+04 1.36e+04 4.87e+04 4.93e+04 1.10e+04 21658.000 1.52e+05 1.30e+05 1.424 3.681 99.952
hs_grad 14 16563 8.70e-01 7.90e-02 8.83e-01 8.78e-01 6.70e-02 0.025 1.00e+00 9.75e-01 -1.593 6.583 0.001
college 15 18469 5.71e-01 1.16e-01 5.72e-01 5.72e-01 1.22e-01 0.152 9.11e-01 7.59e-01 -0.088 -0.263 0.001
unempl 16 18469 5.00e-02 2.00e-02 4.60e-02 4.80e-02 1.70e-02 0.012 2.40e-01 2.28e-01 1.690 6.492 0.000
child_poverty 17 18469 2.21e-01 9.10e-02 2.09e-01 2.14e-01 9.00e-02 0.024 7.47e-01 7.23e-01 0.695 0.537 0.001
single_parent 18 18469 3.14e-01 1.06e-01 3.06e-01 3.08e-01 9.60e-02 0.000 8.72e-01 8.72e-01 0.696 1.251 0.001
severe_housing 19 18470 1.42e-01 4.70e-02 1.37e-01 1.39e-01 3.70e-02 0.022 7.13e-01 6.91e-01 2.086 14.166 0.000
food_index 20 18394 7.32e+00 1.18e+00 7.50e+00 7.44e+00 1.04e+00 0.000 1.00e+01 1.00e+01 -1.386 3.743 0.009
mh_providers 21 17028 1.00e-03 2.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 2.40e-02 2.40e-02 3.457 24.730 0.000
pop_provider_ratio 22 17028 2.00e+03 2.84e+03 9.90e+02 1.38e+03 8.97e+02 -957.000 5.49e+04 5.58e+04 4.291 37.897 21.797
pop 23 18469 1.05e+05 3.34e+05 2.67e+04 4.40e+04 2.79e+04 810.000 1.02e+07 1.02e+07 13.652 310.896 2457.693
pct_below18 24 18469 2.23e-01 3.40e-02 2.22e-01 2.22e-01 2.80e-02 0.051 4.20e-01 3.69e-01 0.500 2.410 0.000
pct_black 25 18469 9.10e-02 1.44e-01 2.30e-02 5.70e-02 2.80e-02 0.000 8.59e-01 8.59e-01 2.265 5.074 0.001
pct_native_am 26 18469 2.30e-02 7.70e-02 6.00e-03 8.00e-03 5.00e-03 0.000 9.31e-01 9.31e-01 7.643 66.364 0.001
pct_asian 27 18469 1.50e-02 2.80e-02 7.00e-03 9.00e-03 5.00e-03 0.000 4.30e-01 4.30e-01 6.880 66.735 0.000
pct_pacific 28 18469 1.00e-03 4.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 1.31e-01 1.31e-01 21.327 545.017 0.000
pct_hispanic 29 18469 9.40e-02 1.37e-01 4.20e-02 6.10e-02 3.60e-02 0.004 9.64e-01 9.59e-01 3.100 11.051 0.001
pct_white 30 18469 7.63e-01 2.01e-01 8.36e-01 7.94e-01 1.60e-01 0.027 9.81e-01 9.55e-01 -1.185 0.796 0.001
pct_female 31 18469 4.99e-01 2.20e-02 5.03e-01 5.02e-01 1.10e-02 0.265 5.70e-01 3.05e-01 -3.180 17.262 0.000
pct_rural 32 18447 5.78e-01 3.12e-01 5.87e-01 5.90e-01 3.82e-01 0.000 1.00e+00 1.00e+00 -0.131 -1.121 0.002
region_abb* 33 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008
#Summary Statistics of relevant Variables used in analysis below in dfram dataset
dframe2<-subset(ranked, select = c("mental_health_days", "mental_distress_rate", "inequality", "median_inc", "hs_grad", "college", "unempl", "child_poverty","single_parent", "severe_housing", "food_index","mh_providers","pop_provider_ratio"))
table2<-describeBy(dframe2, type = 1)
table2 %>%
  kbl(caption="Summary Statistics for Relevant Quantitative Variables",
       format= "html", col.names = c("Var Num.","Count","Mean","Std. Dev.","Median", "Trimmed Mean", "Mad", "Minimum", "Maximum","Range", "Skewness", "Kurtosis", "S.E."),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Summary Statistics for Relevant Quantitative Variables
Var Num. Count Mean Std. Dev. Median Trimmed Mean Mad Minimum Maximum Range Skewness Kurtosis S.E.
mental_health_days 1 18469 4.04e+00 6.92e-01 4.02e+00 4.03e+00 7.06e-01 2.100 7.29e+00 5.19e+00 0.223 -0.112 0.005
mental_distress_rate 2 18469 1.26e-01 2.40e-02 1.24e-01 1.25e-01 2.40e-02 0.066 2.47e-01 1.81e-01 0.532 0.302 0.000
inequality 3 18465 4.52e+00 7.33e-01 4.41e+00 4.45e+00 6.28e-01 2.543 1.20e+01 9.43e+00 1.254 3.433 0.005
median_inc 4 18469 5.08e+04 1.36e+04 4.87e+04 4.93e+04 1.10e+04 21658.000 1.52e+05 1.30e+05 1.424 3.681 99.952
hs_grad 5 16563 8.70e-01 7.90e-02 8.83e-01 8.78e-01 6.70e-02 0.025 1.00e+00 9.75e-01 -1.593 6.583 0.001
college 6 18469 5.71e-01 1.16e-01 5.72e-01 5.72e-01 1.22e-01 0.152 9.11e-01 7.59e-01 -0.088 -0.263 0.001
unempl 7 18469 5.00e-02 2.00e-02 4.60e-02 4.80e-02 1.70e-02 0.012 2.40e-01 2.28e-01 1.690 6.492 0.000
child_poverty 8 18469 2.21e-01 9.10e-02 2.09e-01 2.14e-01 9.00e-02 0.024 7.47e-01 7.23e-01 0.695 0.537 0.001
single_parent 9 18469 3.14e-01 1.06e-01 3.06e-01 3.08e-01 9.60e-02 0.000 8.72e-01 8.72e-01 0.696 1.251 0.001
severe_housing 10 18470 1.42e-01 4.70e-02 1.37e-01 1.39e-01 3.70e-02 0.022 7.13e-01 6.91e-01 2.086 14.166 0.000
food_index 11 18394 7.32e+00 1.18e+00 7.50e+00 7.44e+00 1.04e+00 0.000 1.00e+01 1.00e+01 -1.386 3.743 0.009
mh_providers 12 17028 1.00e-03 2.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 2.40e-02 2.40e-02 3.457 24.730 0.000
pop_provider_ratio 13 17028 2.00e+03 2.84e+03 9.90e+02 1.38e+03 8.97e+02 -957.000 5.49e+04 5.58e+04 4.291 37.897 21.797

The summary statistics show that we have nearly complete data for our most relevant quantitative variables – namely, the measures of mental health, income inequality, and median household income. Several other important economic variables, such as child_poverty, single_parent, and severe_housing, have nearly complete data too.

Our intended dependent variables, mental_health_days and mental_distress_rate, each have skewness and kurtosis below 1, which indicates those variables have relatively normal distributions. The mean and median for mentally unhealthy days in the past 30 days are similar at 4.04 and 4.017, with a standard deviation of 0.692. The mean mental_distress_rate is 0.126 with a standard deviation of 0.024. The median is slightly below the mean.

The economic variables, particularly inequality, have greater skewness and kurtosis, suggesting their distributions are less symmetrical and have larger tails. inequality has a mean of 4.524, a median of 4.413, and a standard deviation of 0.733.

Scatterplots

For a first visual depiction of the relationship between mental health and income inequality, we graphed scatterplots of the county-level data.

Both poor mental health days and the frequent mental distress rate are positively correlated with income inequality. When coloring the scatterplot by median income, counties with a lower median income tend toward the outer limits of the figure, which suggests that poorer counties generally have higher inequality and worse mental health. It’s harder to infer a clear relationship when coloring the points by pct_rural, although the many of the highly unequal counties are heavily urban.

Further, it might be important to get a sense of whether the availability of mental health care is related to measures of mental health. Our data contains the ratio of population to mental health providers, which could be thought of as representing “the number of individuals served by one mental health provider in a county, if the population were equally distributed across providers.” Surprisingly, there is not much of a correlation between the mental health outcomes and this ratio.

# data: use ranked
# colnames(ranked)

#-- Dependent vs Explanatory --#

# color = median income
color_value <- "Median Inc. ($T)"

c1 <- ggplot(ranked, aes(y=mental_health_days, x=inequality, color=median_inc/1000)) + 
  geom_point(show.legend = F) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +  ## Add linear regression line
  labs(title = "(a) Per 30 Days",
       y = 'Poor Mental Health Days', x = 'Income Inequality Rate', 
       color = color_value)

c2 <- ggplot(ranked, aes(y=mental_distress_rate, x=inequality, color=median_inc/1000)) + 
  geom_point() +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = "(b) Percentage",
       y = 'Mental Distress Rate', x = 'Income Inequality Rate', 
       color = color_value)

# color = % rural
color_value <- "% Rural Pop."

c3 <- ggplot(ranked, aes(y=mental_health_days, x=inequality, color=pct_rural)) + 
  geom_point(show.legend = F) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +  ## Add linear regression line
  labs(title = "(a) Per 30 Days",
       y = 'Poor Mental Health Days', x = 'Income Inequality Rate', 
       color = color_value)

c4 <- ggplot(ranked, aes(y=mental_distress_rate, x=inequality, color=pct_rural)) + 
  geom_point() +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = "(b) Percentage",
       y = 'Mental Distress Rate', x = 'Income Inequality Rate', 
       color = color_value)


#-- Dependent vs. Provision of Mental Health Care --#

# color = % rural
color_value <- "% Rural Pop."

c5 <- ggplot(ranked, aes(y=mental_health_days, x=pop_provider_ratio, color=pct_rural)) + 
  geom_point(show.legend = F) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = "(a) Per 30 Days",
       y = 'Poor Mental Health Days per 30', x = 'Ratio Population : Providers', 
       color = color_value)

c6 <- ggplot(ranked, aes(y=mental_distress_rate, x=pop_provider_ratio, color=pct_rural)) + 
  geom_point() +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = "(b) Percentage",
       y = 'Mental Distress Rate', x = 'Ratio Population : Providers', 
       color = color_value)

# combine using patchwork 
c_a <- c1 + c2 + plot_annotation(title = "Scatterplot of Mental Health Variables vs. Income Inequality")
c_a

c_b <- c3 + c4 + plot_annotation(title = "Scatterplot of Mental Health Variables vs. Income Inequality")
c_b

c_c <- c5 + c6 + plot_annotation(title = "Scatterplot of Mental Health Variables vs. Availability of Mental Health Care")
c_c

We also plotted the mental health measures vs. inequality by region. The results are striking. The Midwest and the South both exhibit a moderate correlation between the variables. The West has a weaker, but still positive, correlation. And in the Northeast, mental health and inequality does not seem to be correlated.

# data: use ranked_MW, ranked_NE, ranked_SO, ranked_WE

rgb_colors <- c("#A27BB8", "#006994", "#B52E1F", "#00873E")

#-- mental_health_days --#

p1 <- ggplot(ranked_MW, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p_regions_1 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Poor Mental Health Days vs. Income Inequality by Region")
p_regions_1

#-- mental_distress_rate --#

p1 <- ggplot(ranked_MW, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p_regions_2 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Frequent Mental Distress Rate vs. Income Inequality by Region")
p_regions_2

Notably, we compared both mental_health_days and mental_distress_rate against inequality and found similar trends.

At this point in the EDA, it became clear that geographic differences in the relationship between mental health and inequality were important. These results justified our decision to focus our SMART question on the differences by region. Our next step in exploration was to use boxplots to get a better sense of how the data varied across regional categories.

Boxplots

To better show the relationship between the economic variables and mental health in different regions of the United States, we generated boxplots of the data by region. Analyzing the economic inequality and mental health of people by region can more clearly assess whether the relationship varies by geography than direct analysis of the whole United States. To highlight the key points, we only focus on two independent variables (inequality and median_inc) and two dependent variables (mental_health_days and mental_distress_rate). In total, we created 8 groups of boxplots.

In the first group we first analyzed the median household income of these four regions from 2016 to 2021. This is a horizontal analysis. Through this set of analysis, it can be easily concluded that the median household income in the NOrtheast region is the highest, followed by the West region. This is actually not difficult to imagine, because the Northeast and West regions have big cities like New York and Los Angeles respectively, so the economy is developed and the median household income is very high. The median household income in the South and Midwest is relatively low.

# Characterization of the Median household income of four regions in the United States in 2016.
b1 <- ggplot(ranked16, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2016", y="", x="")

# Characterization of the Median household income of four regions in the United States in 2017.
b2 <-ggplot(ranked17, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2017", y="", x="")

# Characterization of the Median household income of four regions in the United States in 2018.
b3 <-ggplot(ranked18, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2018", y="", x="")

# Characterization of the Median household income of four regions in the United States in 2019.
b4 <-ggplot(ranked19, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2019", y="", x="")

# Characterization of the Median household income of four regions in the United States in 2020.
b5 <-ggplot(ranked20, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2020", y="", x="")

# Characterization of the Median household income of four regions in the United States in 2021.
b6 <-ggplot(ranked21, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2021", y="", x="")

boxplot1 <- grid.arrange(b1,b2,b3,b4,b5,b6, nrow=2, ncol=3, top = text_grob("The Characterization of Median Income by Region per Year", color = "black", face = "bold", size = 14))

In the second group, we analyzed the median household income of the four regions for the six years separately, instead of putting them into one graph as in Boxplot1. In this set of analyses, we can see that the trend of median household income in all 4 regions gradually upward from 2016 to 2021. For example, in south region, the median household income in 2016 was about $40,000, and by 2021, it was close to $50,000. That’s a good thing, proving that household incomes rose across the U.S. in the six years between 2016 and 2021.

# Characterization of the Median household income of northeast region from 2016 to 2021.
b1 <- ggplot(ranked_NE, aes(x=factor(year), y=median_inc)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Northeast", x="year")

# Characterization of the Median household income of south region from 2016 to 2021.
b2 <- ggplot(ranked_SO, aes(x=factor(year), y=median_inc)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="South", x="year")

# Characterization of the Median household income of west region from 2016 to 2021.
b3 <- ggplot(ranked_WE, aes(x=factor(year), y=median_inc)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="West", x="year")

# Characterization of the Median household income of midwest region from 2016 to 2021.
b4 <- ggplot(ranked_MW, aes(x=factor(year), y=median_inc)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Midwest", x="year")

boxplot2 <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2, top = text_grob("Median Income by Region from 2016 to 2021", color = "black", face = "bold", size = 14))

In the third group, the boxplots show the income inequality of 4 regions from 2016 to 2021. The trend of these 6 plots are similar. Income inequality is biggest in the South. This also happens to validate the six plots above for median household income. The South, which has the lowest median household income, is likely to have the greatest income inequality problem.

# Characterization of Income inequality of four regions in the United States in 2016.
b1 <- ggplot(ranked16, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2016", x="region")

# Characterization of Income inequality of four regions in the United States in 2017.
b2 <- ggplot(ranked17, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2017", x="region")

# Characterization of Income inequality of four regions in the United States in 2018.
b3 <- ggplot(ranked18, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2018", x="region")

# Characterization of Income inequality of four regions in the United States in 2019.
b4 <- ggplot(ranked19, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2019", x="region")

# Characterization of Income inequality in four regions of the United States in 2020.
b5 <- ggplot(ranked20, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2020", x="region")

# Characterization of Income inequality of four regions in the United States in 2021.
b6 <- ggplot(ranked21, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2021", x="region")

boxplot3 <- grid.arrange(b1,b2,b3,b4,b5,b6, nrow=2, ncol=3, top = text_grob("The Characterization of Income Inequality by Region per Year", color = "black", face = "bold", size = 14))

In the fourth group, we separately analyzed the income inequality issues for these four regions over the 6 years. From these four photographs, we can easily find that no matter which region it is, it has been around 4.4% for 6 consecutive years.

# Characterization of the income inequality of northeast region from 2016 to 2021.
b1 <- ggplot(ranked_NE, aes(x=factor(year), y=inequality)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Northeast", x="year")

# Characterization of the income inequality of south region from 2016 to 2021.
b2 <- ggplot(ranked_SO, aes(x=factor(year), y=inequality)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="South", x="year")

# Characterization of the income inequality of west region from 2016 to 2021.
b3 <- ggplot(ranked_WE, aes(x=factor(year), y=inequality)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="West", x="year")

# Characterization of the income inequality of midwest region from 2016 to 2021.
b4 <- ggplot(ranked_MW, aes(x=factor(year), y=inequality)) +
  geom_boxplot() +
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Midwest", x="year")

boxplot4 <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2, top = text_grob("Income Inequality by Region from 2016 to 2021", color = "black", face = "bold", size = 14))

In the fifth group of boxplots, we begin to analyze the dependent variables. In Boxplot5, we analyze the poor mental health days of people in four regions from 2016 to 2021. From the graph, we can easily conclude that the south region has the biggest poor mental health days. This is consistent with our previous findings from Boxplot1. From Boxplot1, we know that south region has the lowest median household income of the 4 regions. Because the South has the lowest median household income of the four regions, people in the South could be more likely to be unhappy. In this case, they are more prone to mental problems.

# Characterization of the Poor mental health days/month of four regions in the United States in 2016.
b7 <- ggplot(ranked16, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2016", y="", x="")

# Characterization of the Poor mental health days/month of four regions in the United States in 2017.
b8 <- ggplot(ranked17, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2017", y="", x="")

# Characterization of the Poor mental health days/month of four regions in the United States in 2018.
b9 <- ggplot(ranked18, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2018", y="", x="")

# Characterization of the Poor mental health days/month of four regions in the United States in 2019.
b10 <- ggplot(ranked19, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2019", y="", x="")

# Characterization of the Poor mental health days/month of four regions in the United States in 2020.
b11<-ggplot(ranked20, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2020", y="", x="")

# Characterization of the Poor mental health days/month of four regions in the United States in 2021.
b12<-ggplot(ranked21, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2021", y="", x="")

boxplot5 <- grid.arrange(b7,b8,b9,b10,b11,b12, nrow=2, ncol=3, top = text_grob("The Characterization of Mentally Unhealthy Days/Month by Region per Year", color = "black", face = "bold", size = 14))

In the sixth group, we separately analyzed the poor mental health days of people in these 4 regions in the past 6 years. Similar to the second group of boxplots, which indicated that people’s median household income is rising regardless of region, even though median household income is increasing year by year, mental health issues are also increasing. One potential option is that income growth may be not keeping up with local prices, which exacerbates mental health problems.

# Characterization of the Poor mental health days/month of northeast region from 2016 to 2021.
b1 <- ggplot(ranked_NE, aes(x=factor(year), y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Northeast", x="year")

# Characterization of the Poor mental health days/month of south region from 2016 to 2021.
b2 <- ggplot(ranked_SO, aes(x=factor(year), y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="South", x="year")

# Characterization of the Poor mental health days/month of west region from 2016 to 2021.
b3 <- ggplot(ranked_WE, aes(x=factor(year), y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="West", x="year")

# Characterization of the Poor mental health days/month of midwest region from 2016 to 2021.
b4 <- ggplot(ranked_MW, aes(x=factor(year), y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Midwest", x="year")

boxplot6 <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2, top = text_grob("Poor Mental Health Days by Region from 2016 to 2021", color = "black", face = "bold", size = 14))

In the seventh group, we analyzed the frequent mental distress rate of these four regions over the past 6 years. The results are similar to those for poor mental health days.

# Characterization of the Frequent mental distress rate of four regions in the United States in 2016.
b13<-ggplot(ranked16, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2016", x="region")

# Characterization of the Frequent mental distress rate of four regions in the United States in 2017.
b14<-ggplot(ranked17, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2017", x="region")

# Characterization of the Frequent mental distress rate of four regions in the United States in 2018.
b15<-ggplot(ranked18, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2018", x="region")

# Characterization of the Frequent mental distress rate of four regions in the United States in 2019.
b16<-ggplot(ranked19, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2019", x="region")

# Characterization of the Frequent mental distress rate of four regions in the United States in 2020.
b17<-ggplot(ranked20, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2020", x="region")

# Characterization of the Frequent mental distress rate of four regions in the United States in 2021.
b18<-ggplot(ranked21, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="2021", x="region")

boxplot7 <- grid.arrange(b13,b14,b15,b15,b17,b18, nrow=2, ncol=3, top = text_grob("The Characterization of Frequent Mental Distress Rate by Region per Year", color = "black", face = "bold", size = 14))

In the eight group, we separately analyzed the frequent mental distress rate in these four regions over the past 6 years. We can easily find that the frequent mental distress rate gradually goes up, no matter in which region. Combined with prior results, even though the income of local people is increasing year by year, it is still easy for them to feel stressed.

# Characterization of the Frequent mental distress rate of northeast region from 2016 to 2021.
b1 <- ggplot(ranked_NE, aes(x=factor(year), y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent mental distress rate of northeast region from 2016 to 2021", x="year")

# Characterization of the Frequent mental distress rate of south region from 2016 to 2021.
b2 <- ggplot(ranked_SO, aes(x=factor(year), y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent mental distress rate of south region from 2016 to 2021", x="year")

# Characterization of the Frequent mental distress rate of west region from 2016 to 2021.
b3 <- ggplot(ranked_WE, aes(x=factor(year), y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent mental distress rate of west region from 2016 to 2021", x="year")

# Characterization of the Frequent mental distress rate of midwest region from 2016 to 2021.
b4 <- ggplot(ranked_MW, aes(x=factor(year), y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent mental distress rate of midwest region from 2016 to 2021", x="year")

boxplot8 <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2, top = text_grob("Frequent Mental Distress Rate by Region from 2016 to 2021", color = "black", face = "bold", size = 14))

Overall, what information can we conclude from these boxplots?

  1. In the four U.S. regions, the Northeast has the highest median household income and the South has the lowest.
  2. Income inequality is highest in the South.
  3. Residents in the South also have the highest number of mentally unhealthy days per month, which is correlates with their relatively low household income and higher levels of inequality.
  4. Household incomes have been rising across the U.S. for the past six years, but income inequality has not changed much.
  5. In addition, even though household incomes are increasing across the United States, residents’ mental health problems are still growing.

As a whole, there seems to be real differences across regions in these key variables. The boxplots also indicate that these variables have outliers. As a result, we should look into the normality of the data before hypothesis testing.

Normality

We also graphed histograms of many of the numeric variables to consider their distributions. Although none are perfectly “normal” distributions, many do seem to approximate normality. Most importantly, mental_distress_rate and mental_health_days have distributions that resemble normality. Generally, the economic variables – such as inequality, median_inc, and unempl – are right-skewed.

# variables to plot
keep_var <- c('inequality', 'unempl', 'child_poverty', 'hs_grad', 'college', 'single_parent', 'severe_housing', 'food_index', 'mental_health_days', 'mental_distress_rate', 'median_inc', 'pop_provider_ratio')

# plot histograms
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(x=value)) + 
    facet_wrap(~ key, scales = "free") + 
    geom_histogram(bins=50, color = "#033C5A") +
  labs(title = "Histograms for Select Numeric Variables")

We also assessed the normality of mental_health_days, mental_distress_rate, and inequality by region. In general, the distributions are relatively consistent across each region.

# Midwest
mw1 <- ggplot(ranked_MW, aes(x=mental_health_days)) + geom_histogram(bins = 50) + ggtitle("Midwest")

mw2 <- ggplot(ranked_MW, aes(x=mental_distress_rate)) + geom_histogram(bins = 50) + ggtitle("")

mw3 <- ggplot(ranked_MW, aes(x=inequality)) + geom_histogram(bins = 50) + ggtitle("")

# West
w1 <- ggplot(ranked_WE, aes(x=mental_health_days)) + geom_histogram(bins = 50) + ggtitle("West")

w2 <- ggplot(ranked_WE, aes(x=mental_distress_rate)) + geom_histogram(bins = 50) + ggtitle("")

w3 <- ggplot(ranked_WE, aes(x=inequality)) + geom_histogram(bins = 50) + ggtitle("")

# Northeast
ne1 <- ggplot(ranked_NE, aes(x=mental_health_days)) + geom_histogram(bins = 50) + ggtitle("Northeast")

ne2 <- ggplot(ranked_NE, aes(x=mental_distress_rate)) + geom_histogram(bins = 50) + ggtitle("")

ne3 <- ggplot(ranked_NE, aes(x=inequality)) + geom_histogram(bins = 50) + ggtitle("")

# South
s1 <- ggplot(ranked_SO, aes(x=mental_health_days)) + geom_histogram(bins = 50) + ggtitle("South")

s2 <- ggplot(ranked_SO, aes(x=mental_distress_rate)) + geom_histogram(bins = 50) + ggtitle("")

s3 <- ggplot(ranked_SO, aes(x=inequality)) + geom_histogram(bins = 50) + ggtitle("")

hist_regions_a <- grid.arrange(mw1,mw2,mw3,ne1,ne2,ne3, nrow=2, ncol=3, top = text_grob("Regional Histograms (a)", color = "black", face = "bold", size = 14))

hist_regions_b <- grid.arrange(s1,s2,s3,w1,w2,w3, nrow=2, ncol=3, top = text_grob("Regional Histograms (b)", color = "black", face = "bold", size = 14))

To complement the histograms, we also graphed Q-Q plots to assess the normality of our data. These plots compare the theoretical and sample distributions for a variable at specific quantiles. The line on each graph estimates what the variable would look like with a normal distribution.

In general, the mental health variables do not diverge too far from normality. Also, inequality and median_inc tightly follow a normal distribution until the upper range of their values, reflecting the greater spread that exists at the high end of the distribution.

# plot histograms
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(sample=value), na.rm=T) + 
    facet_wrap(~ key, scales = "free") + 
    stat_qq(color = "#033C5A") + stat_qq_line() + labs(title = "Q-Q Plots for Select Numeric Variables", y = "Sample Quantiles", x = "Theoretical Quantiles")

Despite not having a perfectly normal distribution, the sample quantiles remain relatively close to the theoretical quantiles for many of our variables. In other words, each Q-Q plot is not that different from a normal Q-Q plot, which would be represented by its points closely following the Q-Q line.

Ultimately, even if none of our variables are perfectly “normal,” few would seem to pose substantial issues during hypothesis testing. inequality and median_inc have right-tailed distributions, but without very thick tails. Several more extreme exceptions might include pop_provider_ratio, severe_housing, unempl, and food_index. We should consider removing outliers or transforming these variables before using them in linear modeling.

If we decide to remove any outliers using the ezids package, we should do it here.

If we decide to transform any of the data (e.g., log transformation), we should do it here.

Correlation Matrix

We also assessed the correlation among the numeric variables. This helps us establish whether variables are positively or negatively associated with each other, as well as the strength of that relationship.

In general, the results were not surprising. The two mental health variables are very positively correlated, and both are also positively correlated with inequality. In addition, inequality exhibited the strongest positive association with child_poverty and single_parent, and it had the strongest negative correlation with food_index and median_inc. These results make sense (remember that a higher score on the food index indicates a better food environment).

# Correlation Matrices for numeric data

ranked_numeric <- subset(ranked, select = c("mental_health_days", "mental_distress_rate", "inequality", "median_inc", "hs_grad", "college", "unempl", "child_poverty","single_parent", "severe_housing", "food_index","mh_providers","pop_provider_ratio"))

a <- as.matrix(ranked_numeric)

b <- cor(a, use = "na.or.complete")

#corr_numbers <- corrplot(b, is.corr=TRUE, method="number", title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

corr_numbers <- corrplot(b, is.corr=TRUE, title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

EDA Summary

Based on our EDA, we are able to draw preliminary conclusions concerning our questions. The scatterplots reveal a positive relationship between inequality and the mental health variables. More specifically, counties with greater inequality generally have worse mental health, according to the data. However, the relationship seems to differ by region. The Midwest, South, and west exhibit positive correlations, while the association in the Northeast is flat.

The boxplots clarify these regional differences. The South generally has lowest median incomes and highest inequality, while also having the worst mental health. Conversely, the Midwest has the lowest inequality, moderate median incomes, and also the best mental health outcomes. The Northeast, which has the highest median household incomes, seems to also have the smallest variation in income inequality and mental health outcomes.

EDA was also helpful to refine our questions as we became increasingly familiar with the data. Our initial hypothesis focused on the link between economic inequality and mental health. Reviewing the literature prompted us to think about whether this relationship differed by geography. Visually contrasting these variables by region confirmed that such differences may exist and pushed us to continue investigating whether the association would hold up in a more formal context. Thus, to get clearer answers to our questions, we estimated mean values by region and conducted hypothesis tests across regional samples.

Estimation and Hypothesis Testing

We conducted several types of analysis to demonstrate a statistically significant relationship between mental health and inequality that differs by region.

T-Intervals

First, we estimated the t-intervals by region for three variables at the 99% confidence level: inequality, mental_health_days, and mental_distress_rate. This helped us get a better sense of by how much the mean values differ across regions and see whether their confidence intervals overlap. Given the large size of our data set, we expect robust results even if our variables have skewed distributions.

# Estimation: Establish confidence intervals
# t-interval by region for inequality, mental_health_days, mental_distress_rate

# estimate for all regions
ttest99_in <- t.test(x=ranked$inequality, conf.level=0.99)
ttest99_mh_days <- t.test(x=ranked$mental_health_days, conf.level=0.99)
ttest99_mh_rate <- t.test(x=ranked$mental_distress_rate, conf.level=0.99)

ci99_in <- paste0("[", paste0(round(ttest99_in$conf.int, 4), collapse = ", "), "]")

ci99_mh_days <- paste0("[", paste0(round(ttest99_mh_days$conf.int, 4), collapse = ", "), "]")

ci99_mh_rate <- paste0("[", paste0(round(ttest99_mh_rate$conf.int, 4), collapse = ", "), "]")

var_list <- c("inequality", "mental_health_days", "mental_distress_rate")

ci_list <- list(rep(c("All"), times = length(var_list)), var_list, c(ci99_in, ci99_mh_days, ci99_mh_rate))


# estimate by region
region_list <- c("Midwest", "Northeast", "South", "West")

for (r in region_list) {  ## region names
  
  ranked_region <- ranked %>% subset(region==r)
  
  ttest99_in <- t.test(ranked_region$inequality, conf.level=0.99)
  ttest99_mh_days <- t.test(ranked_region$mental_health_days, conf.level=0.99)
  ttest99_mh_rate <- t.test(ranked_region$mental_distress_rate, conf.level=0.99)
  
  ci99_in <- paste0("[", paste0(round(ttest99_in$conf.int, 4), collapse = ", "), "]")
  
  ci99_mh_days <- paste0("[", paste0(round(ttest99_mh_days$conf.int, 4), collapse = ", "), "]")
  
  ci99_mh_rate <- paste0("[", paste0(round(ttest99_mh_rate$conf.int, 4), collapse = ", "), "]")
  
  ci_list[[1]] <- append(ci_list[[1]], rep(c(r), times = length(var_list)))
  ci_list[[2]] <- append(ci_list[[2]], var_list)
  ci_list[[3]] <- append(ci_list[[3]], c(ci99_in, ci99_mh_days, ci99_mh_rate))
  
  #print(ci_list)
}

# print results
for (n in 1:15) {
  
  print(paste(ci_list[[1]][n], ci_list[[2]][n], ci_list[[3]][n], sep=" | "))
  
}

The t-intervals by region are reported for each variable of interest in the following table:


Region Variable 99% conf.int
All inequality [4.5097, 4.5375]
All mental_health_days [4.0266, 4.0529]
All mental_distress_rate [0.1256, 0.1265]
Midwest inequality [4.1641, 4.2011]
Midwest mental_health_days [3.686, 3.7301]
Midwest mental_distress_rate [0.115, 0.1165]
Northeast inequality [4.4926, 4.5863]
Northeast mental_health_days [3.9758, 4.041]
Northeast mental_distress_rate [0.1193, 0.1215]
South inequality [4.7855, 4.8277]
South mental_health_days [4.315, 4.3513]
South mental_distress_rate [0.1355, 0.1368]
West inequality [4.3659, 4.4358]
West mental_health_days [3.8556, 3.9113]
West mental_distress_rate [0.1196, 0.1216]

Based on visual inspection, the mean values of the variables seem different depending on the region. This is consistent with our earlier exploratory data analysis using boxplots. Using hypothesis tests to compare the mean values of inequality, mental_health_days, and mental_distress_rate across regions will give us greater clarity. Since there are 4 regions, we cannot use a 2-sample t-test. Instead, we conducted ANOVA tests for each variable with data for different regions as the samples.

ANOVA Tests

ANOVA tests make assumptions about the normality and variances of the variables. We examined the normality of inequality, mental_health_days, and mental_distress_rate previously during EDA and found that, in general, the distributions are relatively consistent across each region and the variances do not seem to differ substantially.

# anova test for inequality
aov_in1 <- aov(inequality ~ region, data = ranked)

summary_in <- summary(aov_in1)

#xkabledply(aov_in, title = "ANOVA results: Income inequality by Region")

aov_in <- tidy(aov_in <- aov(inequality ~ region, data = ranked))
aov_in %>%
  kbl(caption="ANOVA Results: Income Inequality by Region",
       format= "html", col.names = c("Res.Df", "RSS", "DF", "Sum of Sq","F","Pr(>F)"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
ANOVA Results: Income Inequality by Region
Res.Df RSS DF Sum of Sq F Pr(>F)
region 3 1434 477.94 1039 0
Residuals 18461 8492 0.46 NA NA
tukey_in <- TukeyHSD(aov_in1)
#tukey_in

tukey_in$region %>%
  kbl(caption="Tukey HSD Test: Income Inequailty by Region",
       format= "html", col.names = c("Diff", "Lower", "Upper", "p adj"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Tukey HSD Test: Income Inequailty by Region
Diff Lower Upper p adj
Northeast-Midwest 0.357 0.304 0.410 0
South-Midwest 0.624 0.595 0.653 0
West-Midwest 0.218 0.177 0.259 0
South-Northeast 0.267 0.215 0.319 0
West-Northeast -0.139 -0.198 -0.079 0
West-South -0.406 -0.445 -0.366 0

The ANOVA test for income inequality produced an F-statistic of 1038.937 with a p-value of 0. We reject the null hypothesis because the p-value is below a significance level of 0.05, and accept the alternative hypothesis that the means of the regions are not all the same. We followed up with a post-hoc test to determine which groups are significantly different from the others. The Tukey HSD results indicate that all regions are significantly different from each other!

# mental health days
aov_mh_days1 <- aov(mental_health_days ~ region, data = ranked)

summary_mh_days <- summary(aov_mh_days1)

#xkabledply(aov_mh_days, title = "ANOVA results: Poor Mental Health Days by Region")

aov_mh_days <- tidy(aov(mental_health_days ~ region, data = ranked))
aov_mh_days %>%
  kbl(caption="ANOVA Results: Poor Mental Health Days by Region",
       format= "html", col.names = c("Res.Df", "RSS", "DF", "Sum of Sq","F","Pr(>F)"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
ANOVA Results: Poor Mental Health Days by Region
Res.Df RSS DF Sum of Sq F Pr(>F)
region 3 1471 490.347 1230 0
Residuals 18465 7363 0.399 NA NA
tukey_mh_days <- TukeyHSD(aov_mh_days1)
#tukey_mh_days

tukey_mh_days$region %>%
  kbl(caption="Tukey HSD Test: Poor Mental Health Days by Region",
       format= "html", col.names = c("Diff", "Lower", "Upper", "p adj"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Tukey HSD Test: Poor Mental Health Days by Region
Diff Lower Upper p adj
Northeast-Midwest 0.300 0.251 0.350 0
South-Midwest 0.625 0.598 0.652 0
West-Midwest 0.175 0.137 0.214 0
South-Northeast 0.325 0.276 0.373 0
West-Northeast -0.125 -0.180 -0.070 0
West-South -0.450 -0.486 -0.413 0
# frequent mental distress rate
aov_mh_rate1 <- aov(mental_distress_rate ~ region, data = ranked)

summary_mh_rate <- summary(aov_mh_rate1)

#xkabledply(aov_mh_rate, title = "ANOVA results: Frequent Mental Distress by Region")

aov_mh_rate <- tidy(aov(mental_distress_rate ~ region, data = ranked))
aov_mh_rate %>%
  kbl(caption="ANOVA Results: Frequent Mental Distress by Region",
       format= "html", col.names = c("Res.Df", "RSS", "DF", "Sum of Sq","F","Pr(>F)"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
ANOVA Results: Frequent Mental Distress by Region
Res.Df RSS DF Sum of Sq F Pr(>F)
region 3 1.64 0.547 1151 0
Residuals 18465 8.78 0.000 NA NA
tukey_mh_rate <- TukeyHSD(aov_mh_rate1)
#tukey_mh_rate

tukey_mh_rate$region %>%
  kbl(caption="Tukey HSD Test: Frequent Mental Distress by Region",
       format= "html", col.names = c("Diff", "Lower", "Upper", "p adj"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Tukey HSD Test: Frequent Mental Distress by Region
Diff Lower Upper p adj
Northeast-Midwest 0.005 0.003 0.006 0.000
South-Midwest 0.020 0.019 0.021 0.000
West-Midwest 0.005 0.004 0.006 0.000
South-Northeast 0.016 0.014 0.017 0.000
West-Northeast 0.000 -0.002 0.002 0.996
West-South -0.016 -0.017 -0.014 0.000

The ANOVA test for poor mental health days produced an F-statistic of 1229.661 with a p-value of 0. Again, we reject the null hypothesis and followed up with a post-hoc Tukey HSD. The Tukey results indicate that all regions have significantly different means for mental_health_days.

The ANOVA test for the frequent mental distress rate produced an F-statistic of 1150.962 with a p-value of 0. After rejecting the null, the Tukey results suggest that all regions, except the Northeast and the West, have significantly different means for mental_distress_rate.

Spearman’s Rank Correlation

We can also conduct a Spearman’s rank test to see whether the mental health variables are independent from income inequality and median income.

spearman_ineq_v_days <- cor.test(ranked$inequality, ranked$mental_health_days, method="spearman", exact=F)
#spearman_ineq_v_days

corr_table1 <- data.frame(estimate = c(spearman_ineq_v_days$estimate))
corr_table1 <- rbind(corr_table1, spearman_ineq_v_days$statistic)
corr_table1 <- rbind(corr_table1, spearman_ineq_v_days$p.value)
rownames(corr_table1) <- c("Sample Estimates: rho", "Test-Statistic: S", "p-value")

corr_table1 %>%
  kbl(caption="Spearman's Rank Correlation rho: Mental Health Days v. Inequality",
       format= "html", col.names = c(""),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")

spearman_ineq_v_rate <- cor.test(ranked$inequality, ranked$mental_distress_rate, method="spearman", exact=F)
#spearman_ineq_v_rate

corr_table2 <- data.frame(estimate = c(spearman_ineq_v_rate$estimate))
corr_table2 <- rbind(corr_table2, spearman_ineq_v_rate$statistic)
corr_table2 <- rbind(corr_table2, spearman_ineq_v_rate$p.value)
rownames(corr_table2) <- c("Sample Estimates: rho", "Test-Statistic: S", "p-value")

corr_table2 %>%
  kbl(caption="Spearman's Rank Correlation rho: Mental Distress Rate v. Inequality",
       format= "html", col.names = c(""),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")


spearman_medinc_v_days <- cor.test(ranked$median_inc, ranked$mental_health_days, method="spearman", exact=F)
#spearman_medinc_v_days

corr_table3 <- data.frame(estimate = c(spearman_medinc_v_days$estimate))
corr_table3 <- rbind(corr_table3, spearman_medinc_v_days$statistic)
corr_table3 <- rbind(corr_table3, spearman_medinc_v_days$p.value)
rownames(corr_table3) <- c("Sample Estimates: rho", "Test-Statistic: S", "p-value")

corr_table3 %>%
  kbl(caption="Spearman's Rank Correlation rho: Mental Health Days v. Median Income",
       format= "html", col.names = c(""),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")


spearman_medinc_v_rate <- cor.test(ranked$median_inc, ranked$mental_distress_rate, method="spearman", exact=F)
#spearman_medinc_v_rate

corr_table4 <- data.frame(estimate = c(spearman_medinc_v_rate$estimate))
corr_table4 <- rbind(corr_table4, spearman_medinc_v_rate$statistic)
corr_table4 <- rbind(corr_table4, spearman_medinc_v_rate$p.value)
rownames(corr_table4) <- c("Sample Estimates: rho", "Test-Statistic: S", "p-value")

corr_table4 %>%
  kbl(caption="Spearman's Rank Correlation rho: Mental Distress Rate v. Median Income",
       format= "html", col.names = c(""),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")

The Spearman correlation tests are summarized in the following table:


Variables rho p-value
inequality vs. mental_health_days 0.399 0
inequality vs. mental_distress_rate 0.43 0
median_inc vs. mental_health_days -0.45 0
median_inc vs. mental_distress_rate -0.497 0

Based on the small p-values below 0.05, we can reject the null hypothesis and conclude that the correlations between the variables are significantly different from zero. These findings suggest that there is a real relationship between mental health and income inequality.

Conclusion

Here we summarize our results and conclude with a clear sense of what we’ve learned. We should re-state our SMART question(s) and highlight how they have changed or evolved.

  • As the research by T et al. showed, controlling for economic deprivation is important
  • info based on medical records

are there policy differences between south and midwest that explain?

The most progress could be made in the South:

Therefore, I think the government should be committed to strengthening the economic construction of the South, increasing the income of residents in the South, and reducing income inequality across the United States.

References

Bechtel, L., Lordan, G., & Rao, D. P. (2012). Income inequality and mental health—empirical evidence from Australia. Health economics, 21, 4-17.

Kelley, J., & Evans, M. D. (2017). Societal Inequality and individual subjective well-being: Results from 68 societies and over 200,000 individuals, 1981–2008. Social science research, 62, 1-23.

Layte, R. (2012). The association between income inequality and mental health: testing status anxiety, social capital, and neo-materialist explanations. European Sociological Review, 28(4), 498-511.

Matthew, P., & Brodersen, D. M. (2018). Income inequality and health outcomes in the United States: An empirical analysis. The Social Science Journal, 55(4), 432-442.

Pickett, K. E., & Wilkinson, R. G. (2015). Income inequality and health: a causal review. Social science & medicine, 128, 316-326.

Ribeiro, W. S., Bauer, A., Andrade, M. C. R., York-Smith, M., Pan, P. M., Pingani, L., Coutinho, E.S.F., & Evans-Lacko, S. (2017). Income inequality and mental illness-related morbidity and resilience: a systematic review and meta-analysis. The Lancet Psychiatry, 4(7), 554-562.

Robert Wood Johnson Foundation (RWJF) (2021). 2021 County Health Rankings National Data. County Health Rankings & Roadmaps. https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation. Accessed: March 13, 2022.

Robert Wood Johnson Foundation (RWJF) (2021). County Health Rankings Model. County Health Rankings & Roadmaps. https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/county-health-rankings-model. Accessed: March 13, 2022.

Sommet, N., Morselli, D., & Spini, D. (2018). Income inequality affects the psychological health of only the people facing scarcity. Psychological Science, 29(12), 1911-1921.

Tibber, M. S., Walji, F., Kirkbride, J. B., & Huddy, V. (2021). The association between income inequality and adult mental health at the subnational level—a systematic review. Social psychiatry and psychiatric epidemiology, 1-24.

U.S. Census Bureau. (2010). Census Regions and Divisions of the United States. U.S. Department of Commerce Economics and Statistics Administration. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf. Accessed: March 13, 2022.

Zimmerman, F. J., & Bell, J. F. (2006). Income inequality and physical and mental health: testing associations consistent with proposed causal pathways. Journal of Epidemiology & Community Health, 60(6), 513-521.